Let’s start by loading the dplyr
package:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Did you notice the warning messages? What’s going on there?
It turns out that the dplyr
package has a function named filter()
, but the stats
package, which is automatically loaded when you start an R session, also has a function named filter()
! So, if I type the command filter(dataset, ...)
, how does R know which filter()
function to use?
R looks for the function filter()
starting with the package that was loaded most recently, and going backwards in time. Since dplyr
was the last package loaded, R will assume that we meant dplyr
’s version of filter()
and use that.
What if I meant the stats
version of filter()
instead? Is there a way that I can reference it? Yes! We can use “double colon” notation: stats::filter()
. (The general syntax for this is packageName::functionName()
.)
Today we’ll be working with the flights dataset from the nycflights13 package. Let’s load the nycflights13
package and the flights dataset (use install.packages("nycflights13")
if you don’t have the packge yet:
library(nycflights13)
data(flights)
Next, use the ?
, str()
and View()
functions to examine the dataset:
?flights
str(flights)
View(flights)
This dataset contains ~336,000 flights that departed from New York City (all 3 airports) in 2013.
Next, just key in the dataset name (i.e. flights
):
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Did you notice that the output format is different from what we’ve seen before? That’s because previous datasets were in a data structure that we called data frames, while this is in a data structure called a tibble. Don’t worry about the difference: for all intents and purposes, data frames are the same as tibbles.
filter()
and logical operationsSince we are here in Stanford, we may only be interested in flights from NYC to SFO. We can use the filter()
verb to achieve this:
flights %>% filter(dest == "SFO")
## # A tibble: 13,331 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2 923
## 2 2013 1 1 611 600 11 945
## 3 2013 1 1 655 700 -5 1037
## 4 2013 1 1 729 730 -1 1049
## 5 2013 1 1 734 737 -3 1047
## 6 2013 1 1 745 745 0 1135
## 7 2013 1 1 746 746 0 1119
## 8 2013 1 1 803 800 3 1132
## 9 2013 1 1 826 817 9 1145
## 10 2013 1 1 1029 1030 -1 1427
## # ... with 13,321 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Note that we used ==
to test whether dest
was equal to "SFO"
. DO NOT USE =
. In programming, =
usually means variable assignment.
There are two other international airports near Stanford, San Jose International Airport (“SJC”) and Oakland International Airport (“OAK”). So if we want to analyze flights that people take to get from NYC to Stanford, we should probably include these flights.
flights %>% filter(dest == "SFO" | dest == "SJC" | dest == "OAK")
## # A tibble: 13,972 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2 923
## 2 2013 1 1 611 600 11 945
## 3 2013 1 1 655 700 -5 1037
## 4 2013 1 1 729 730 -1 1049
## 5 2013 1 1 734 737 -3 1047
## 6 2013 1 1 745 745 0 1135
## 7 2013 1 1 746 746 0 1119
## 8 2013 1 1 803 800 3 1132
## 9 2013 1 1 826 817 9 1145
## 10 2013 1 1 1029 1030 -1 1427
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The command above filters the dataset and prints it out, but does not retain the output. To keep the extracted dataset for further analysis, we have to assign it to a variable:
Stanford <- flights %>% filter(dest == "SFO" | dest == "SJC" | dest == "OAK")
We now have flights from NYC to SFO/SJC/OAK for the entire year. Let’s say that I’m only interested in flights when school is in session (Sep - Jun). Since month
is a numeric variable, we could do this:
Stanford %>% filter(month <= 6 | month >= 9)
## # A tibble: 11,351 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2 923
## 2 2013 1 1 611 600 11 945
## 3 2013 1 1 655 700 -5 1037
## 4 2013 1 1 729 730 -1 1049
## 5 2013 1 1 734 737 -3 1047
## 6 2013 1 1 745 745 0 1135
## 7 2013 1 1 746 746 0 1119
## 8 2013 1 1 803 800 3 1132
## 9 2013 1 1 826 817 9 1145
## 10 2013 1 1 1029 1030 -1 1427
## # ... with 11,341 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
or this:
Stanford %>% filter(month != 7 & month != 8)
## # A tibble: 11,351 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2 923
## 2 2013 1 1 611 600 11 945
## 3 2013 1 1 655 700 -5 1037
## 4 2013 1 1 729 730 -1 1049
## 5 2013 1 1 734 737 -3 1047
## 6 2013 1 1 745 745 0 1135
## 7 2013 1 1 746 746 0 1119
## 8 2013 1 1 803 800 3 1132
## 9 2013 1 1 826 817 9 1145
## 10 2013 1 1 1029 1030 -1 1427
## # ... with 11,341 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Let’s return to the Stanford
dataset (i.e. all flights from NYC to SFO/SJC/OAK). Notice that we have a total of 19 variables. Sometimes our datasets will have hundreds or thousands of variables! Not all of them may be of interest to us. select()
allows us to choose a subset of these variables to form a smaller dataset that may be easier to work with.
19 is a pretty small number so we could do our data analysis without dropping any columns, but let’s just try out some commands to get a feel for how select()
works.
We can select columns by name: if we just want the year, month and day columns, we can use the following code:
Stanford %>% select(year, month, day)
## # A tibble: 13,972 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 13,962 more rows
If the columns we want form a contiguous block, then we can use simpler syntax. To select rows from year
to arr_delay
(inclusive):
Stanford %>% select(year:arr_delay)
## # A tibble: 13,972 x 9
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2 923
## 2 2013 1 1 611 600 11 945
## 3 2013 1 1 655 700 -5 1037
## 4 2013 1 1 729 730 -1 1049
## 5 2013 1 1 734 737 -3 1047
## 6 2013 1 1 745 745 0 1135
## 7 2013 1 1 746 746 0 1119
## 8 2013 1 1 803 800 3 1132
## 9 2013 1 1 826 817 9 1145
## 10 2013 1 1 1029 1030 -1 1427
## # ... with 13,962 more rows, and 2 more variables: sched_arr_time <int>,
## # arr_delay <dbl>
In this example, the year
column is superfluous, since all the values are all 2013. The code below drops the year column, keeping the rest:
Stanford %>% select(-year)
## # A tibble: 13,972 x 18
## month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <dbl> <int> <int>
## 1 1 1 558 600 -2 923 937
## 2 1 1 611 600 11 945 931
## 3 1 1 655 700 -5 1037 1045
## 4 1 1 729 730 -1 1049 1115
## 5 1 1 734 737 -3 1047 1113
## 6 1 1 745 745 0 1135 1125
## 7 1 1 746 746 0 1119 1129
## 8 1 1 803 800 3 1132 1144
## 9 1 1 826 817 9 1145 1158
## 10 1 1 1029 1030 -1 1427 1355
## # ... with 13,962 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
select()
can also be used to rearrange the columns. If, for example, I wanted to have the first 3 columns be day, month, year instead of year, month, day:
Stanford %>% select(day, month, year, everything())
## # A tibble: 13,972 x 19
## day month year dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 1 1 2013 558 600 -2 923
## 2 1 1 2013 611 600 11 945
## 3 1 1 2013 655 700 -5 1037
## 4 1 1 2013 729 730 -1 1049
## 5 1 1 2013 734 737 -3 1047
## 6 1 1 2013 745 745 0 1135
## 7 1 1 2013 746 746 0 1119
## 8 1 1 2013 803 800 3 1132
## 9 1 1 2013 826 817 9 1145
## 10 1 1 2013 1029 1030 -1 1427
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
To rename column names, use the rename()
function:
Stanford %>% rename(tail_num = tailnum)
## # A tibble: 13,972 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2 923
## 2 2013 1 1 611 600 11 945
## 3 2013 1 1 655 700 -5 1037
## 4 2013 1 1 729 730 -1 1049
## 5 2013 1 1 734 737 -3 1047
## 6 2013 1 1 745 745 0 1135
## 7 2013 1 1 746 746 0 1119
## 8 2013 1 1 803 800 3 1132
## 9 2013 1 1 826 817 9 1145
## 10 2013 1 1 1029 1030 -1 1427
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Often we get datasets which are not in order, or in an order which we are not interested in. The arrange()
function allows us to reorder the rows according to an order we want.
The Stanford
dataset looks like it is already ordered by actual departure time. Perhaps I’m most interested in the flights which had the longest departure delay. I could sort the dataset as follows:
Stanford %>% arrange(dep_delay)
## # A tibble: 13,972 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 12 11 710 730 -20 1039
## 2 2013 11 16 712 730 -18 1025
## 3 2013 9 11 712 730 -18 946
## 4 2013 11 19 713 730 -17 1036
## 5 2013 7 14 1151 1208 -17 1450
## 6 2013 12 10 714 730 -16 1104
## 7 2013 3 29 1050 1106 -16 1359
## 8 2013 4 20 1420 1436 -16 1737
## 9 2013 5 20 719 735 -16 951
## 10 2013 1 23 545 600 -15 948
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Looks like the flights with the shortest delay are at the top instead! To re-order by descending order, use desc()
:
Stanford %>% arrange(desc(dep_delay))
## # A tibble: 13,972 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 9 20 1139 1845 1014 1457
## 2 2013 7 7 2123 1030 653 17
## 3 2013 7 7 2059 1030 629 106
## 4 2013 7 6 149 1600 589 456
## 5 2013 7 10 133 1800 453 455
## 6 2013 7 10 2342 1630 432 312
## 7 2013 7 7 2204 1525 399 107
## 8 2013 7 7 2306 1630 396 250
## 9 2013 6 23 1833 1200 393 NA
## 10 2013 7 10 2232 1609 383 138
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
(Wow, that’s a really long delay! Almost 17 hours.) To extract just the flights with the top 10 departure delays, we can use the head()
function:
Stanford %>%
arrange(desc(dep_delay)) %>%
head(n = 10)
## # A tibble: 10 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 9 20 1139 1845 1014 1457
## 2 2013 7 7 2123 1030 653 17
## 3 2013 7 7 2059 1030 629 106
## 4 2013 7 6 149 1600 589 456
## 5 2013 7 10 133 1800 453 455
## 6 2013 7 10 2342 1630 432 312
## 7 2013 7 7 2204 1525 399 107
## 8 2013 7 7 2306 1630 396 250
## 9 2013 6 23 1833 1200 393 NA
## 10 2013 7 10 2232 1609 383 138
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
arrange()
also allows us to filter by more than one column, in that each additional column will be used to break ties in the values of the preceding ones. For example, flights
seems to be sorted by year, month, day, and actual departure time. If I wanted to sort by year, month, day and scheduled departure time instead:
Stanford %>% arrange(year, month, day, sched_dep_time)
## # A tibble: 13,972 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2 923
## 2 2013 1 1 611 600 11 945
## 3 2013 1 1 655 700 -5 1037
## 4 2013 1 1 729 730 -1 1049
## 5 2013 1 1 734 737 -3 1047
## 6 2013 1 1 745 745 0 1135
## 7 2013 1 1 746 746 0 1119
## 8 2013 1 1 803 800 3 1132
## 9 2013 1 1 826 817 9 1145
## 10 2013 1 1 1029 1030 -1 1427
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
In this dataset we have both the time the plane spent in the air (air_time
) and distance traveled (distance
). From these two pieces of information, we can figure out the average speed of the plane for the flight using mutate()
.
mutate()
adds new columns to the end of the dataset, so let’s work with a smaller dataset for now so that we can see the values of our new column.
Stanford_small <- Stanford %>%
select(month, carrier, origin, dest, air_time, distance) %>%
mutate(speed = distance / air_time * 60)
Stanford_small
## # A tibble: 13,972 x 7
## month carrier origin dest air_time distance speed
## <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 UA EWR SFO 361 2565 426.
## 2 1 UA JFK SFO 366 2586 424.
## 3 1 DL JFK SFO 362 2586 429.
## 4 1 VX JFK SFO 356 2586 436.
## 5 1 B6 JFK SFO 350 2586 443.
## 6 1 AA JFK SFO 378 2586 410.
## 7 1 UA EWR SFO 373 2565 413.
## 8 1 UA JFK SFO 369 2586 420.
## 9 1 UA EWR SFO 357 2565 431.
## 10 1 AA JFK SFO 389 2586 399.
## # ... with 13,962 more rows
mutate()
can be used to create several new variables at once. For example, the following code is valid syntax:
Stanford_small %>% mutate(speed_miles_per_min = air_time / distance,
speed_miles_per_hour = speed_miles_per_min * 60)
## # A tibble: 13,972 x 9
## month carrier origin dest air_time distance speed speed_miles_per…
## <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 UA EWR SFO 361 2565 426. 0.141
## 2 1 UA JFK SFO 366 2586 424. 0.142
## 3 1 DL JFK SFO 362 2586 429. 0.140
## 4 1 VX JFK SFO 356 2586 436. 0.138
## 5 1 B6 JFK SFO 350 2586 443. 0.135
## 6 1 AA JFK SFO 378 2586 410. 0.146
## 7 1 UA EWR SFO 373 2565 413. 0.145
## 8 1 UA JFK SFO 369 2586 420. 0.143
## 9 1 UA EWR SFO 357 2565 431. 0.139
## 10 1 AA JFK SFO 389 2586 399. 0.150
## # ... with 13,962 more rows, and 1 more variable:
## # speed_miles_per_hour <dbl>
If we only want to keep the newly created variables, use transmute()
instead of mutate()
.
Let’s make use of our plotting skills from last session to see if there are any trends in air time. First, let’s make a histogram of air_time
:
library(ggplot2)
ggplot(data = Stanford_small) +
geom_histogram(aes(x = air_time))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 162 rows containing non-finite values (stat_bin).
Did you notice the warning message about rows being removed for “containing non-finite values”? If you view the Stanford_small
dataset and scroll all the way down, you’ll notice that there are some rows which have NA
for air_time
. Since we don’t know what the air time is, we can’t compute the speed and we can’t plot it.
As a data analyst, NA
s are something to watch out for as they could invalidate your analysis. Why are these data missing? Is it completely at random, or is there something going on? For this session, we will just leave them in the dataset.
It seems like the air time of planes might vary depending on the origin and destination, so let’s facet on these 2 variables:
ggplot(data = Stanford_small) +
geom_histogram(aes(x = air_time)) +
facet_grid(origin ~ dest)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 162 rows containing non-finite values (stat_bin).
We learn 3 things from this plot: (i) there are no flights from La Guardia (LGA) to any of the 3 airports; (ii) there are no flights from Newark (EWR) to SJC/OAK; and (iii) there are very few flights from NYC to SJC/OAK compared to SFO. It’s hard to tell if there are differences in the distributions; the optional material section explores this question further.
Instead of looking at plots, we can try to look at summary statistics instead. What was the mean/median air time for flights in our Stanford_small
dataset? We can use the summarize()
function to help us:
Stanford_small %>% summarize(mean_airtime = mean(air_time))
## # A tibble: 1 x 1
## mean_airtime
## <dbl>
## 1 NA
Stanford_small %>% summarize(median_airtime = median(air_time))
## # A tibble: 1 x 1
## median_airtime
## <dbl>
## 1 NA
The NA
s are causing us trouble! We need to specify the na.rm = TRUE
option to remove NA
s from consideration:
Stanford_small %>% summarize(mean_airtime = mean(air_time, na.rm = TRUE))
## # A tibble: 1 x 1
## mean_airtime
## <dbl>
## 1 346.
Stanford_small %>% summarize(median_airtime = median(air_time, na.rm = TRUE))
## # A tibble: 1 x 1
## median_airtime
## <dbl>
## 1 345
summarize()
gives me a summary of the entire dataset. If I want summaries by group, then I have to use summarize()
in conjunction with group_by()
. group_by()
changes the unit of analysis from the whole dataset to individual groups. The following code groups the dataset by carrier, then computes the summary statistic for each group:
Stanford_small %>%
group_by(carrier) %>%
summarize(mean_airtime = mean(air_time, na.rm = TRUE)) %>%
arrange(desc(mean_airtime))
## # A tibble: 5 x 2
## carrier mean_airtime
## <chr> <dbl>
## 1 AA 348.
## 2 VX 348.
## 3 DL 347.
## 4 B6 347.
## 5 UA 344.
I can also group by more than one variable. For example, if I wanted to count the number of flights for each carrier in each month, I could use the following code:
Stanford_small %>%
group_by(month, carrier) %>%
summarize(count = n())
## # A tibble: 60 x 3
## # Groups: month [?]
## month carrier count
## <int> <chr> <int>
## 1 1 AA 120
## 2 1 B6 121
## 3 1 DL 142
## 4 1 UA 422
## 5 1 VX 124
## 6 2 AA 108
## 7 2 B6 106
## 8 2 DL 127
## 9 2 UA 378
## 10 2 VX 104
## # ... with 50 more rows
We can even “pipe” the dataset to ggplot()
to plot the data!
Stanford_small %>%
group_by(month, carrier) %>%
summarize(count = n()) %>%
ggplot(mapping = aes(x = month, y = count, col = carrier)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 1:12)
%in%
operatorRecall that we used the following line of code to extract flights that landed in SFO, SJC or OAK:
Stanford <- flights %>% filter(dest == "SFO" | dest == "SJC" | dest == "OAK")
We can use the %in%
operator to make the code more succinct:
flights %>% filter(dest %in% c("SFO", "SJC", "OAK"))
## # A tibble: 13,972 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2 923
## 2 2013 1 1 611 600 11 945
## 3 2013 1 1 655 700 -5 1037
## 4 2013 1 1 729 730 -1 1049
## 5 2013 1 1 734 737 -3 1047
## 6 2013 1 1 745 745 0 1135
## 7 2013 1 1 746 746 0 1119
## 8 2013 1 1 803 800 3 1132
## 9 2013 1 1 826 817 9 1145
## 10 2013 1 1 1029 1030 -1 1427
## # ... with 13,962 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The %in%
operator is very useful, especially we are checking if dest
belongs to a long list of airports.
Let’s remove the rows with air_time
being NA
:
Stanford_small <- Stanford_small %>%
filter(!is.na(air_time))
One theory we might have is that different carriers have different air times. Let’s do a facet on carrier
:
ggplot(data = Stanford_small) +
geom_histogram(aes(x = air_time)) +
facet_grid(carrier ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The first thing we notice is that UA has many more flights than the other carriers. Because all 5 histograms have the same y-axis, this causes the other histograms to be obscured. To allow each histogram to have its own y-axis, we can add a scales
argument to facet_grid()
:
ggplot(data = Stanford_small) +
geom_histogram(mapping = aes(x = air_time)) +
facet_grid(carrier ~ ., scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As you can see, the histograms have very similar shapes, suggesting that the air times of various carriers is roughly the same. The one thing that we might notice is are the tails on the right.
A plot that is increasing in popularity for plotting multiple histograms or density plots is the joy plot. The plot looks like a series of overlapping mountain ranges which can be compared against each other more easily than the histograms. The code below produces a joy plot:
library(ggridges)
##
## Attaching package: 'ggridges'
## The following object is masked from 'package:ggplot2':
##
## scale_discrete_manual
ggplot(data = Stanford_small, aes(x = air_time, y = carrier)) +
geom_density_ridges(scale = 5)
## Picking joint bandwidth of 3.24
(Play around with the scale parameter and see what happens.)
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggridges_0.5.0 ggplot2_3.0.0 bindrcpp_0.2.2
## [4] nycflights13_1.0.0 dplyr_0.7.6 knitr_1.20
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 pillar_1.3.0 compiler_3.5.1 plyr_1.8.4
## [5] bindr_0.1.1 tools_3.5.1 digest_0.6.15 evaluate_0.10.1
## [9] tibble_1.4.2 gtable_0.2.0 pkgconfig_2.0.1 rlang_0.2.1
## [13] cli_1.0.0 yaml_2.1.19 withr_2.1.2 stringr_1.3.1
## [17] rprojroot_1.3-2 grid_3.5.1 tidyselect_0.2.4 glue_1.2.0
## [21] R6_2.2.2 fansi_0.2.3 rmarkdown_1.10 purrr_0.2.5
## [25] reshape2_1.4.3 magrittr_1.5 backports_1.1.2 scales_0.5.0
## [29] htmltools_0.3.6 assertthat_0.2.0 colorspace_1.3-2 labeling_0.3
## [33] utf8_1.1.4 stringi_1.2.3 lazyeval_0.2.1 munsell_0.5.0
## [37] crayon_1.3.4